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Abstract 

Background: Protein complexes play important roles in biological systems such as gene regulatory networks and 
metabolic pathways. Most methods for predicting protein complexes try to find protein complexes with size more 
than three. It, however, is known that protein complexes with smaller sizes occupy a large part of whole 
complexes for several species. In our previous work, we developed a method with several feature space mappings 
and the domain composition kernel for prediction of heterodimeric protein complexes, which outperforms existing 
methods. 

Results: We propose methods for prediction of heterotrimeric protein complexes by extending techniques in the 
previous work on the basis of the idea that most heterotrimeric protein complexes are not likely to share the same 
protein with each other. We make use of the discriminant function in support vector machines (SVMs), and design 
novel feature space mappings for the second phase. As the second classifier, we examine SVMs and relevance 
vector machines (RVMs). We perform 10-fold cross-validation computational experiments. The results suggest that 
our proposed two-phase methods and SVM with the extended features outperform the existing method NWE, 
which was reported to outperform other existing methods such as MCL, MCODE, DPCIus, CMC, COACH, RRW, and 
PPSampler for prediction of heterotrimeric protein complexes. 

Conclusions: We propose two-phase prediction methods with the extended features, the domain composition 
kernel, SVMs and RVMs. The two-phase method with the extended features and the domain composition kernel 
using SVM as the second classifier is particularly useful for prediction of heterotrimeric protein complexes. 



Background 

To identify a set of proteins as a functional protein com- 
plex is essential for understanding molecular systems in 
living cells. Several proteins form a complex and work as 
a transcription factor, whereas there exist another type of 
proteins that work as enzymes. Hence, to identify pro- 
teins that constitute such transcription factors is useful 
for uncovering gene regulatory networks and metabolic 
pathways. Many computational methods have been 
developed for predicting protein complexes from pro- 
tein-protein interaction networks [1,2]. Enright et al. 
developed the Markov cluster (MCL) algorithm [3], 
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which repeatedly executes two operators called expansion 
and inflation to a matrix whose element represents the 
transition probability from a protein to another. The 
expansion operation takes the power of the matrix, and 
the inflation operation takes the Hadamard power of the 
matrix. MCL is fast and efficient because of these opera- 
tions. Macropol et al. developed the repeated random 
walks (RRW) method [4], which iteratively expands a 
cluster depending on the probabilities in steady states of 
random walks with restarts. Maruyama and Chihara 
improved the RRW method by weighting the restart 
probabilities and proposed the node-weighted expansion 
(NWE) method [5]. Bader and Hogue developed the 
molecular complex detection (MCODE) method [6], 
which uses a modified clustering coefficient defined by 
edge density in a subset of the original and adjacent 
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vertices to find densely connected regions. King et al. 
developed the restricted neighborhood search clustering 
(RNSC) method [7], which selects clusters generated by a 
cost function according to the cluster size, density and 
functional homogeneity. Altaf-Ul-Amin et al. developed 
DPClus [8], which tries to find densely connected 
regions. Chua et al. developed the protein complex pre- 
diction (PCP) method [9], which finds maximal cliques 
using the functional similarity weight based on indirect 
interactions. Liu et al. developed the clustering based on 
maximal cliques (CMC) method [10], which generates all 
maximal cliques from the protein-protein interaction 
networks, and assembles highly overlapped clusters based 
on their interconnectivity. Wu et al. developed the core- 
attachment based (COACH) method [11]. Most methods 
basically focus on finding densely connected subgraph in 
protein-protein interaction networks. Hence, it is consid- 
ered to be difficult that they detect small protein com- 
plexes because, for instance, the edge density of two 
interacting proteins is always 1.0 even if the proteins do 
not form a complex. 

However, protein complexes with small sizes occupy a 
large part of whole known protein complexes. CYC2008 is 
a comprehensive catalogue of 408 manually curated yeast 
protein complexes [12]. In the catalogue, 172 complexes 
(42%) are heterodimeric, and 87 complexes (21%) are het- 
erotrimeric as reported also in [13]. In our previous study, 
hence, we developed a method using our proposed kernel 
for predicting heterodimeric protein complexes [14], 
which outperforms an existing method using the naive 
Bayes classifier [15]. In this paper, we propose prediction 
methods for heterotrimeric protein complexes by extend- 
ing techniques in our previous method on the basis of the 
idea that heterotrimeric protein complexes are not likely 
to share the same protein with other heterotrimeric pro- 
tein complexes. For that purpose, we apply supervised 
learning methods twice such as support vector machine 
(SVM) [16] and relevance vector machine (RVM) [17]. 
Tatsuke and Maruyama developed the proteins' partition 
sampler (PPSampler) method based on the Metropolis- 
Hastings algorithm, which generates clusters whose sizes 
follow a power-law distribution, and outperforms other 
existing methods in F-measure for whole protein com- 
plexes [13]. For prediction of heterotrimeric protein com- 
plexes, they reported that the F-measure of NWE was 
better than those of the existing methods, MCL, MCODE, 
DPClus, CMC, COACH, RRW, and PPSampler. We per- 
form 10-fold cross-validation, and calculate the average 
F-measure. The results suggest that our proposed methods 
outperform the existing method NWE. 

Methods 

In this section, we propose prediction methods for hetero- 
trimeric protein complexes. More accurately, we consider 



the following problem: Given a network of protein-protein 
interactions weighted by some reliability, determine 
whether or not three distinct proteins that are connected 
in the protein-protein interaction network form a protein 
complex. Let G(V, E) be an undirected graph with a set V 
of vertices and a set E of edges, representing the protein- 
protein interaction network. Here, a vertex represents a 
protein, an edge (i, j) represents an interaction between 
proteins P t and Pj, and the weight Wy represents reliability 
and strength of the interaction between P t and Pj. In this 
paper, we use the WI-PHI database [1] as edge weights, 
which has been calculated from heterogeneous biological 
experimental data. We call P, a neighboring protein to Pj if 
(i, /') G E. Then, our proposed methods use the support 
vector machine (SVM), its discriminant function, and the 
relevance vector machine (RVM). 

Support and relevance vector machine 

We briefly review the support and relevance vector 
machines [16,17]. Suppose that N training data {x it t^ 
with target t t G {-1, 1} are given. For our purpose, x t 
corresponds to a set of three distinct proteins, t t = 1 
corresponds to the case that the set forms a heterotri- 
meric protein complex. Then, we consider linear models 
represented by the form 

M 

y{x) = ^2ai4>i{x) + b, (1) 

where <p, denotes a basis function, M denotes the num- 
ber of basis functions, a t denotes the coefficient, and b 
denotes the bias parameter. In the SVM, <Pi{x) is implicitly 
defined as K{x it x) with a positive semidefinite kernel func- 
tion K, M is equal to N, and «, and b are determined by 
maximizing the margin. New sets x of proteins are classi- 
fied according to the sign of y(x). We make use of this dis- 
criminant function y{x) in our proposed methods. 

The RVM is a Bayesian sparse kernel technique for clas- 
sification and regression, and shares some characteristics 
of the SVM. As well as the SVM, the basis functions of the 
RVM are given by kernels, which are not required to be 
positive semidefinite. It, however, is known that training 
time of the RVM is in general longer than that of the 
SVM. In the RVM, a hyperparameter y ; - for each parameter 
«, and a prior distribution over parameters a, are intro- 
duced to obtain a sparse model. For the classification, the 
model in Eq. (1) is transformed as a(y(x)), where a{y) 
denotes the logistic sigmoid function 1/(1 + e y ), and 
and b are determined by maximizing the marginal log- 
likelihood with respect to y. 

Extension of feature space mapping 

In our previous study, we proposed seven feature space 
mappings for prediction of heterodimeric protein 
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complexes [14]. These are based on the idea that the 
reliability of the interaction in a heterodimer should be 
high and conversely the reliability of the interaction 
between a protein in a heterodimer and a protein not in 
the heterodimer should be low. We extend the feature 
space mappings for two interacting proteins to map- 
pings for three proteins. Table 1 shows detailed 
extended mappings for three distinct proteins P„ Pp and 
Pk that are connected in the protein-protein interaction 
network. Here the fifth mapping in the previous study is 
eliminated because more neighboring proteins increase 
the maximum of differences close to the maximum of 
neighboring weights denoted by (F3). (Fl) and (F2) 
denote the maximum and minimum of the weights of 
interactions between P it Pj, and P k , respectively. The 
first feature in the previous study is the weight of the 
interaction between two proteins. Since there are at 
least two interactions for three focused proteins and we 
cannot use all the weights as elements of our feature 
vector without changes, we take the maximum and mini- 
mum of the weights (see Figure 1). In addition, the pro- 
teins in a heterotrimer should interact with each other, 
and (F2), which is the minimum of the weights, is expected 
to be high. (F3) and (F4) denote the maximum and mini- 
mum of the weights of interactions between either of 
P h Pp P k and a neighboring protein P r , respectively, where 
r * i, j, k and (/, r) e E, (/', r) e E, or (k, r) e E. It is consid- 
ered that (F3), which is the maximum of the neighboring 
weights of a heterotrimer, should be lower than the 
weights of interactions in the heterotrimer. Consider the 
case that a protein P r interacts with two of proteins P it Pj, 
and .?V> where P r is not any of P it Pp and P* (see Figure 1). 
If the weights of both interactions are large, these pro- 
teins including P r may form a complex. We introduce 
the maximum of smaller weights of interactions with 
neighboring proteins P r denoted by (F5). (F6) and (F7) 
denote the maximum and the minimum of the numbers 
of domains contained in P it Pp and P k , respectively. The 
number of domains in a protein complex is expected to 
be large because domains are considered as mediators 
of protein-protein interactions. 



Table 1 Feature space mapping from three distinct 
proteins P„ Pp P k . 



(Fl) 
(F2) 
(F3) 
(F4) 

(F5) 
(F6) 
(F7) 



max w n 

{(p,q)eE\p,q€{i,j,k}} 

min w P a 

{(p,q)eE\p,qe{i,j,k)) 

max uipr 
{(p,r)eE\pe{i,j,mmm 

min w n 
{(p,r)eE\pe{i,j,k),rt{i,m 

max mm{iv pr , w ar \ 

{(p,rUq,r)€E\p,q€{i,j,k),p^,ri{i,j,k)} 

max{# domains of P„ # domains of Pp # domains of P k ] 
min{# domains of P„ # domains of Pj, # domains of P k } 



"pr 




Figure 1 Example of a subgraph including three focused 
proteins P„ Pj, P k and their neighboring proteins In this 
example, protein P r is neighboring to both of P,- and Pj,. 



In addition to the extended features, we examine the 
domain composition kernel developed in our previous 
study [14]. We defined equivalence =j between two pro- 
teins Pi and Pj as the condition that P, consists of the 
same domains of Pj , and defined equivalence = c 
between two sets x t and Xj that consist of {P, 1( • • -,Pi„] 
and {Pj,, - ■ -,PjJ, respectively, as 3 a e 6nVfe(P„ ( =dP; £ , ( „ ) ), 
where &„ denotes the symmetric group of degree n on 
the set {1, «}. Then, the domain composition kernel 
K c was defined by 



1 (if Xi= c Xj), 
0 (otherwise) . 



(2) 



Two-phase learning approach 

Our proposed methods take two-phase learning 
approach. The basic idea for designing our methods is 
that heterotrimeric protein complexes are not likely to 
share the same protein with other heterotrimeric protein 
complexes. We estimate model parameters of SVM 
using training data in the first phase, and predict 
whether or not the training data and the neighboring 
sets sharing at least one protein with the training data 
are heterotrimeric protein complexes, respectively. 
Then, the second phase predictor makes use of the dis- 
criminant values obtained by the first phase predictor. It 
is expected that the discriminant values for a target set 
of proteins and its neighboring set do not become large 
together if heterotrimeric protein complexes do not 
share the same protein. 

Suppose that the training data set comprises N sets x t 
of three distinct proteins with the corresponding label 
t L G {-1, 1}. For each x it we calculate 7-dimensional fea- 
ture vector f l \xi) using (F1),...,(F7) shown in Table 1 
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and the kernel matrix whose (i, y)-th element is (/'^(tf,), 
f (xj)) + aK c (Xi, Xj), where a is a constant and (•, •> 
denotes the inner product. Then, we obtain the model 
parameters in Eq. (1) by applying the SVM to the train- 
ing data set. Let J\f (x) be all sets of three distinct pro- 
teins that are neighboring to x and connected in the 
protein-protein interaction network, where we call x L a 
neighboring set to Xj if Xi and Xj share the same protein 
and Xi is not Xj (see Figure 2). For each x h we calculate 
the discriminant values y(Xi) and y{x) for all x e Af (x*)- 
Since the discriminant values may include outliers, by 
taking the averages of positive and negative discriminant 
values separately, we define four feature space mappings 
for X;, 



f^\ Xl ) = y{x t ), 



(3) 



/(2P)W= |{,e^x,) 1 |yW>0)l £ 



f 2n) {Xi) = 



f {2a \xi) 



\{x e Af{ Xi )\y{x) < 0} | 



Yi x l(5) 



{xeAf{x,)\y{x) < 0} 



\Af( Xi )\ 



(6) 



where |5| denotes the number of elements in the set S. 
Here, we define f {2p) (Xi) = 0 (/ (2B) (*i) = 0,f {2a) (xd = 0) if 
If* e N{xi)\y{x) > 0)| = 0(|{* e N{xi)\y{x) < 0)| = 0, |JV>i)l = 0). 
We compose 11-dimensional feature vector f^ 2 \xi) using 
fWj(2s)j(2 P )j(2n) and y(2«) ; ca i cu i ate the kernel matrix 

with the (i, ;')-th element </ (2) (*;),/ (2) (*/ )) + ccK c {Xi, xj), 




Figure 2 Example of a subgraph including a focused set of 
proteins and neighboring sets of proteins. Each neighboring set 
of three proteins shares at least one protein with the focused set 
(black circle). In this example, set 5, of three proteins shares two 
proteins with the focused set, and 5 2 , S 3 share one protein, 
respectively. 



and we apply some supervised learning method. It should 
be noted that our methods use only training data to esti- 
mate model parameters. For test data x, we calculate 
{f^ 2 \xi),f {2 \x)) + aK c {Xi, x) for training data x it and 
determine whether or not x is a heterotrimeric protein 
complex according to the second classifier. 

Computational experiments 

Data and implementation 

To evaluate our proposed methods, we performed com- 
putational experiments and compared them with the 
existing method NWE [5]. We used the WI-PHI data- 
base [1] containing 49607 interacting protein pairs 
except self interactions as input weights of interactions, 
which is available at the supporting information web 
page of the paper. The weights were obtained from 
high-throughput yeast two-hybrid data [18,19] and sev- 
eral biological databases such as BioGRID [2] and BIND 
[20] by using a log-likelihood score (LLS) to each data- 
set and the socioaffinity (SA) index [21] that measures 
the log-odds score of the number of times that two pro- 
teins are observed to interact to the expectation value 
from the dataset. 

We prepared datasets using heterotrimeric protein 
complexes in CYC2008 protein complex catalogue [12], 
which contains 87 heterotrimeric protein complexes, 
and is available at http://wodaklab.org/cyc2008/. We 
restricted positive and negative examples to sets of three 
distinct proteins that form a single connected compo- 
nent in the input protein-protein interaction network. 
Thus, 7 heterotrimers were eliminated, and we used 80 
heterotrimers as positive examples. For negative exam- 
ples, we extracted 32647 sets of three proteins included 
in protein complexes with size more than three of 
CYC2008, and we selected uniquely at random 100 
examples from the sets because our methods require 
many neighboring sets of three proteins for an example 
in the second phase. It is considered that negative exam- 
ples selected from such sets are more difficult to be 
classified than those selected from all sets of three pro- 
teins except heterotrimers. 

For NWE, we set some options related with the size of 
complexes so that NWE output protein complexes with 
size two or more from the WI-PHI protein-protein inter- 
action network in the same way as [13], and extracted 
only protein complexes with size three from the result. 

For measuring the performance, we used accuracy, 
precision, recall, and F-measure defined by 



TP + TN 



accuracy ■ 



TP + TN + FP + FN' 



precision = 



TP 



TP + FP 



(7) 



(8) 
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recall ■■ 



TP 



TP + FN 



F - measure ■■ 



2 ■ precision ■ recall 



(9) 



(10) 



precision + recall 
where TP, FP, and FN mean the number of true posi- 
tive, false positive, false negative examples, respectively. 

We used 'libsvm' (version 3.11) [22] and 'SparseBayes' 
package (version 2.0) [23] as implementations of SVM 
and RVM, respectively. 

Results 

We performed 10-fold cross-validation, and took the 
average of accuracy, precision, recall, and F-measure. 
Furthermore, we repeated this procedure 10 times for 
other datasets with randomly selected negative exam- 
ples, and took the average. Table 2 shows the results on 
the average of accuracy, precision, recall, and F-measure 
by our proposed methods and NWE. 'SVM+SVM' and 
'SVM+RVM' denote two-phase methods using SVM and 
RVM as the second classifier, respectively. 'SVM' 
denotes usual SVM using only features/' 1 ', a denotes 
the coefficient of the domain composition kernel K c . We 
examined a = 0.5 because the case was best for predic- 
tion of heterodimeric protein complexes in our previous 
study [14]. NWE predicted 54 protein complexes with 
size three from the WI-PHI protein-protein interaction 
network, and 19 of them were actual heterotrimeric pro- 
tein complexes in the CYC2008 protein complex catalo- 
gue. We can see from the table that the F-measure by 
SVM+SVM, SVM+RVM, SVM for both a = 0, and 0.5 
were higher than those by NWE, respectively. Further- 
more, the accuracy and F-measure by the two-phase 
method SVM+SVM were higher than those by usual 
SVM with / respectively. The accuracy and F-mea- 
sure by SVM+RVM, however, were lower than those by 
SVM, respectively. It implies that RVMs may be less 
useful than SVMs for these problems that SVMs can be 

Table 2 Results on the average of accuracy, precision, 
recall, and F-measure by our proposed methods and 
NWE. 



a 


SVM+SVM 


SVM+RVM 


SVM 


NWE 


0 


0.5 


0 


0.5 


0 


0.5 




accuracy 


0.885 


0.907 


0.810 


0.853 


0.861 


0.876 




precision 


0.936 


0.869 


0.847 


0.899 


0.909 


0.873 


0.352 


recall 


0.840 


0.926 


0.770 


0.766 


0.819 


0.862 


0.218 


F-measure 


0.880 


0.891 


0.767 


0.810 


0.854 


0.862 


0.270 



'SVM+SVM' and 'SVM+RVM' denote two-phase methods using SVM and RVM 
as the second classifier, respectively. 'SVM' denotes usual SVM using only 
features a denotes the coefficient of the domain composition kernel K c . 
Note that the accuracy is not defined for NWE because it is unsupervised, and 
predict protein complexes of various sizes. The precision and recall for NWE 
were calculated as TP divided by the numbers of predicted and known 
heterotrimers, respectively. 



applied. Thus, the results suggest that our proposed 
methods SVM+SVM, SVM+RVM, and SVM outperform 
the existing method NWE. The results also suggest the 
usefulness of the second phase. 

Conclusions 

We proposed prediction methods by two-phase learning 
for heterotrimeric protein complexes. In the methods, 
we extended the feature space mappings in our previous 
study for prediction of heterodimeric protein com- 
plexes, and made use of the discriminant function for 
neighboring sets of three proteins. To validate our pro- 
posed methods, we performed 10-fold cross-validation 
computational experiments. The results suggest that 
our two-phase prediction methods and SVM with the 
extended features outperform the existing method 
NWE, which was reported to outperform many other 
existing methods such as MCL, MCODE, DPClus, 
CMC, COACH, RRW, and PPSampler, although our 
methods are limited to prediction of heterotrimeric pro- 
tein complexes. For further evaluation, we would like to 
perform computational experiments for other datasets if 
such data become available. 

We have some possibility to further improve the pre- 
diction accuracy. For instance, we can use sequence 
information for designing feature space mappings as well 
as domains contained in proteins. In addition, we can 
introduce some probabilistic model such as conditional 
and Markov random fields to neighboring sets of three 
proteins although in this paper we considered kernels 
between neighboring sets. 
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